The aim of this script is to perform a statistical analysis of untargeted metabolomics data which has been previously pre-treated with the script “Prepare_Datasets_to_Statistical_Analysis.Rmd”. This performs several tasks to achieve this end:
Read the raw data files from the input folder.
Normalize data set with the package “NormalyzerDE” without the need of a design input (it is created with the information given in the input files).
After the review and evaluation of the normalization report by the researcher, the selected data normalized is uploaded by changing line 178 with the corresponding file. You can choose among the following options: CycLoess-normalized.txt, GI-normalized.txt, log2-normalized.txt, mean-normalized.txt, median-normalized.txt, Quantile-normalized.txt, RLR-normalized.txt, VSN-normalized.txt. If you do not want to use any normalized file, then set it as “None”.
Data visualization before and after the normalization with box plot from “NormalyMets” package.
Principal Component Analysis (PCA) and Hierarchical Cluster Analysis (HCA) from packages “FactoMineR” and “factoextra” for normalized dataset to check the right group differentiation and a possible batch effect.
Statistical analysis, using the “Limma” package, performed to extract the p-values and log2 fold change values (LFC) for every metabolite in the conducted comparative analysis. Additionally, p-value adjustment methods, such as Benjamini-Hochberg, Bonferroni, and q-values were applied. Tables with the results are given, as well as plots displaying the original p-values and q-values, which are available for visualization.
The identification of significant metabolites will be based on the statistical thresholds set at point 4 of the usage instructions. A message will be displayed indicating the number of decreased and increased metabolites for each adjustment method, taking into account the linear model used in the Limma analysis. This will provide insights into the directionality of differential expression for the identified metabolites. [As the test model was set up as_“contrast <- paste0(group[2],”-“,group[1])”. _This instruction represent the measures difference in expression levels between Group 2 and Group 1. A positive LFC indicates higher expression in Group 2 compared to Group 1, while a negative LFC indicates higher expression in Group 1 compared to Group 2. The magnitude of the LFC represents the change in expression between the two groups.] The number of significant differences will be represented by venn-diagrams (“ggVennDiagram” package) and upset plot (“UpSetR” package).
After having selected a p-value adjustment method in line 750, the significant metabolites based on the statistical thresholds will be plotted by Mean-Average plots and Volcano plots, also a dynamic volcano plot is displayed where you can see the information for each dot (metabolite). Tables with the 10 most significant metabolites for each comparative will be displayed.
Finally, a sparse Partial Least Squares Discriminant Analysis (sPLS-DA) will be conducted using the “mixOmics” package. This analysis uses 2 components and 25 variables. The results for the sPLS-DA analysis include a loadings plot, individual samples plot, variable plot, and a prediction essay. Additionally, predictions results and the top 10 most important variable predictors for each comparative will be presented in tables, providing valuable insights into the predictive power of the model and the variables driving the differences between the comparatives.
To run this script, you need to provide the input and output folder addresses, as well as the statistical thresholds.
rm(list = ls())
Input data must proceeded from the script “Prepare_Datasets_to_Statistical_Analysis.Rmd”, because they must have the words “one_factor” in their name.
You will need the following packages
packages <- c("tidyverse", "plotly", "mixOmics", "devtools", "NormalizeMets", "tools", "kableExtra",
"qvalue", "NormalyzerDE", "limma", "FactoMineR", "factoextra", "ggVennDiagram", "UpSetR", "ggpubr",
"ggrepel")
for (package in packages) {
if (!requireNamespace(package, quietly = TRUE)) {
if (package == "NormalizeMets") {
devtools::install_github("metabolomicstats/NormalizeMets")
} else {
install.packages(package)
}
}
library(package, character.only = TRUE)
}
# Please indicate if your data NA values have been impute in the previous script or not. Yes/No
imputed <- "No"
if (imputed == "Yes") {
input_folder <- "../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use"
dir.create(paste0(input_folder, "/Normalyze_data"))
output_folder <- "../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use/Normalyze_data"
dir.create("../Images/Imputed_separate_comparatives")
image_folder <- "../Images/Imputed_separate_comparatives"
} else {
input_folder <- "../1_Data/Untargeted_metabolomic_data/Common_ready_to_use"
dir.create(paste0(input_folder, "/Normalyze_data"))
output_folder <- "../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data"
dir.create("../Images/Common_metabolites")
image_folder <- "../Images/Common_metabolites"
}
# Set Normalize method on line 183 #
# Set adjusted p-value on line 750 #
# Set the statistical parameters for the significant analysis
stats_values <- list()
LFC_limit <- "LFC_limit"; stats_values[[LFC_limit]] <- 2
alpha_value <- "alpha_value"; stats_values[[alpha_value]] <- 0.01
files <- list.files(input_folder)
files <- files[grepl("one_factor", files, ignore.case = T)]
To ensure the correc organization of the comparative A columns and accurate assignment of the comparative E groups, it is necessary to execute this specific code section for my data.
for (i in files){
# Read data files
data_path <- file.path(input_folder, i)
data <- read.csv(data_path)
group <- data [1, ] %>% .[-1] %>% as.character(.) %>% unique(.)
# Perform the changes if necessary
if (group[1] == "cA_S17" & group[2] == "cA_S25") {
data <- data[, c(1, 5:7, 2:4)]
# Overwrite the original file
write.csv(data, file = data_path, row.names = FALSE)
} else if (group[1] == "cA_S17" & group[2] == "cB_OTA_S17") {
group <- c("group", rep("cE_noOTA", 3), rep("cE_OTA", 3))
data <- data [-1, ] %>% rbind(group, .)
# Overwrite the original file
write.csv(data, file = data_path, row.names = FALSE)
}
}
files <- list.files(input_folder)
files <- files[grepl("one_factor", files, ignore.case = T)]
This code section iterates over a list of files, reads each file’s data, prepares the dataset by assigning row names and extracting group information, writes the design and raw data to separate files, applies the normalyzer function for normalization, saves the normalized result, and finally removes temporary files.
for (i in files){
# Read data files
data_path <- file.path(input_folder, i)
raw_data <- read.csv(data_path)
# Prepare dataset
rownames(raw_data) <- raw_data$name
group <- raw_data[1, -1] %>% as.character()
raw_data <- raw_data[-1, -1]
# Design and temporal objects
design <- data.frame(sample=colnames(raw_data), group=group)
write.table(x = design, file = paste0(output_folder, sub("_.*.?", "", i), "_design.tsv"),
quote = F, row.names = F, sep = "\t")
write.table(x = raw_data, file = paste0(output_folder, sub("_.*.?", "", i), "_tmp_data.tsv"),
quote = F, row.names = F, sep = "\t")
# NormalyzerDE aplication
suppressMessages(normalyzer(jobName = paste0(sub("_.*.?", "", i), "_NormalyzerDE"),
designPath = paste0(output_folder, sub("_.*.?", "", i), "_design.tsv"),
dataPath = paste0(output_folder, sub("_.*.?", "", i), "_tmp_data.tsv"),
outputDir = output_folder))
cat(sprintf("File %s has been Normalyzed and the result has been saved in %s.\n", i,
paste0(output_folder, "/", sub("_.*.?", "", i), "_NormalyzerDE\n")))
unlink(paste0(output_folder, sub("_.*.?", "", i), "_design.tsv"), recursive = T)
unlink(paste0(output_folder, sub("_.*.?", "", i), "_tmp_data.tsv"), recursive = T)
}
For the example data that was used to optimize this script and develop the corresponding Final Master Project, you can refer to the normalization summaries in the following files.
Normalized summary of Comparative A
Normalized summary of Comparative B
Normalized summary of Comparative C
Normalized summary of Comparative D
Normalized summary of Comparative E
# You can choose between the followings data normalized files: CycLoess-normalized.txt, GI-normalized.txt, log2-normalized.txt, mean-normalized.txt, median-normalized.txt, Quantile-normalized.txt, RLR-normalized.txt, VSN-normalized.txt.
# If you do not want to use any normalized file set "None"
norm_selected <- "/Quantile-normalized.txt"
if (norm_selected != "None") {
for (i in files){
# Set the new name for the normalized data that will be analysed
name <- paste0(sub("_.*.?", "", i), "_normalized")
data <- read.table(paste0(output_folder, "/", paste0(sub("_.*.?", "", i), "_NormalyzerDE"),
norm_selected), header = T)
# Rename rows according to their original names in the not normalized data. It is posible because the order is mainteined during the normalization process but
data2 <- read.csv(paste0(input_folder, "/", i))
rownames(data2) <- data2$name
group <- data2[1, -1] %>% rownames_to_column(); data2 <- data2[-1, -1]
row.names(data) <- row.names(data2)
# Reestructure the dataset by rearranging the columns
data <- rownames_to_column(data, var = "name")
data <- rbind(as.character(group), data)
data[is.na(data)] <- 0
assign(name, data)
write.csv(data, file = paste0(output_folder, "/", paste0(sub("_.*.?", "", i),
"_normalized.csv")), row.names = FALSE)
cat(sprintf("File %s normalized by %s method has been saved in %s.\n", i,
sub(".*/(.*?)\\-.*", "\\1", norm_selected),
paste0(output_folder, "/", paste0(sub("_.*.?", "", i), "_normalized.csv \n"))))
}
files2 <- ls()
files2 <- files2[grepl("normalized", files2, ignore.case = T)]
for (i in files){
# Set the new name for the normalized data that will be analysed
name <- paste0(sub("_.*.?", "", i), "_no_normalized")
data_path <- file.path(input_folder, i)
data <- read.csv(data_path)
assign(name, data)
files <- ls()
files <- files[grepl("no_normalized", files, ignore.case = T)]
}
} else {
for (i in files){
# Set the new name for the normalized data that will be analysed
name <- paste0(sub("_.*.?", "", i), "_no_normalized")
data_path <- file.path(input_folder, i)
data <- read.csv(data_path)
assign(name, data)
files <- ls()
files <- files[grepl("no_normalized", files, ignore.case = T)]
}
}
## File compA_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compA_normalized.csv
## .
## File compB_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compB_normalized.csv
## .
## File compC_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compC_normalized.csv
## .
## File compD_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compD_normalized.csv
## .
## File compE_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compE_normalized.csv
## .
A box plot is a visual summary of a dataset’s distribution. It helps to identify the center, spread, and skewness of the data. To assess data normality using a box plot, look for symmetry, absence of outliers, balanced box length, symmetric whisker length, and a median close to the box center.
This boxplot has been made thanks to “normalizeMets” package, which use ggplot2 graphs. In raw data case, it was log transformed to set similar axis as the ones of normalized data.
plots1 <- list()
plots2 <- list()
if (norm_selected != "None") {
for (i in files){
data <- get(i)
# Prepare dataset without normalization
group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
met_names <- data$name %>% .[-1]
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1] %>% LogTransform(. ,zerotona=TRUE)
data <- data$featuredata %>% t() %>% `colnames<-`(met_names)
# Original data visualization (log transformed and numeric format)
plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F,
plotname = paste0("BoxPlot of ", sub("_.*.?", "", i), " no normalized"),
savetype = "png", interactiveplot = T)
index <- length(plots1) + 1
plots1[[index]] <- plot
}
for (i in files2){
data <- get(i)
# Prepare dataset normalized
group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
met_names <- data$name %>% .[-1]
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1]
data <- data %>% t() %>% `colnames<-`(met_names)
# Normalized data visualization (log transformed and numeric format)
plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F,
plotname = paste0("BoxPlot of ", sub("_.*.?", "", i),
" normalized by ", sub(".*/(.*?)\\-.*", "\\1", norm_selected), " method"),
savetype = "png", interactiveplot = T)
index <- length(plots2) + 1
plots2[[index]] <- plot
}
} else {
for (i in files){
data <- get(i)
# Prepare dataset without normalization
group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
met_names <- data$name %>% .[-1]
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1] %>% LogTransform(. ,zerotona=TRUE)
data <- data$featuredata %>% t() %>% `colnames<-`(met_names)
# Original data visualization (log transformed and numeric format)
plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F,
plotname = paste0("BoxPlot of ", sub("_.*.?", "", i), " no normalized"),
savetype = "png", interactiveplot = T)
index <- length(plots2) + 1
plots2[[index]] <- plot
}
}
Principal Component Analysis (PCA) is a statistical technique that seeks to reduce the dimensionality of a dataset by maximizing the variance. It achieves this by transforming the original variables into a smaller set of uncorrelated variables called principal components. The goal is to retain as much relevant information as possible while minimizing the loss of information. PCA helps simplifying complex datasets, visualize data, and identify the most important patterns or features contributing to the overall variance.
if (norm_selected == "None") {
for (i in files) {
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
group <- as.character(data[1 ,])
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
# PCA performance
tmp_pca <- PCA(data, graph = FALSE, scale.unit = TRUE, quali.sup = 1)
## Boxplot
tmp_eig <- tmp_pca$eig %>% .[1:5, ]
barplot(tmp_eig[, 2], names.arg=1:nrow(tmp_eig),
main = paste("PCA of", sub("_.*.?", "", i), "no normalized"),
xlab = "Principal Components",
ylab = "Percentage of variances",
col ="gold")
## Graph of individuals
plot <- fviz_pca_ind(tmp_pca, col.ind = group,
pointsize=1, pointshape=21,fill="black",
repel = T,
addEllipses = T, ellipse.type = "confidence",
legend.title ="Conditions",
show_legend = TRUE, show_guide = TRUE)
labels <- rownames(data)
plot <- plot +
ggtitle(paste("PC1 & PC2 of", sub("_.*.?", "", i), "no normalized")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0(image_folder, "/PCA_plot_of_", sub("_.*.?", "", i), "_no_normalized.png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
}
} else {
for (i in files2) {
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
group <- as.character(data[1 ,]); names <- colnames(data)
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
# PCA performance
tmp_pca <- PCA(data, graph = FALSE, scale.unit = TRUE, quali.sup = 1)
## Boxplot
tmp_eig <- tmp_pca$eig %>% .[1:5, ]
barplot(tmp_eig[, 2], names.arg=1:nrow(tmp_eig),
main = paste("PCA of", sub("_.*.?", "", i),
"normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method"),
xlab = "Principal Components",
ylab = "Percentage of variances",
col ="gold")
## Graph of individuals
plot <- fviz_pca_ind(tmp_pca, col.ind = group,
pointsize=1, pointshape=21,fill="black",
repel = T,
addEllipses = T, ellipse.type = "confidence",
legend.title ="Conditions",
show_legend = TRUE, show_guide = TRUE)
labels <- rownames(data)
plot <- plot +
ggtitle(paste("PC1 & PC2 of", sub("_.*.?", "", i),
"normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0(image_folder, "/PCA_plot_of_", sub("_.*.?", "", i), ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
}
}
HCPC (Hierarchical Clustering on Principal Components) is a statistical method that combines hierarchical clustering and principal component analysis (PCA). It is used for exploratory analysis and clustering of multivariate data.
HCPC first performs PCA on the dataset to obtain principal components that capture the variability in the data. Then, it applies hierarchical clustering on these principal components to group similar observations together based on their distance or similarity measures.
HCPC is particularly useful when dealing with high-dimensional datasets, as it allows for a comprehensive exploration of the data by revealing the underlying clusters or patterns. It can assist in identifying subgroups within a dataset and gaining a better understanding of the relationships between variables.
if (norm_selected == "None") {
for (i in files) {
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
group <- as.character(data[1 ,])
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
# HCA performance
res.pca <- PCA(data, graph = FALSE,scale.unit = TRUE,quali.sup = 1)
res.hcpc <- HCPC(res.pca, graph=FALSE,nb.clust = 2)
plot <- fviz_dend(res.hcpc, k = NULL, cex = 0.7, palette = "lancet",
label_cols = "black", rect = T, rect_fill = T, rect_border = "lancet",
type="rectangle", labels_track_height = 1000, horiz = T,repel = T) +
ggtitle(paste("Cluster Dendrogram of", sub("_.*.?", "", i), "no normalized")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0(image_folder, "/HCPC_plot_of", sub("_.*.?", "", i), "_no_normalized.png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
}
} else {
for (i in files2) {
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
group <- as.character(data[1 ,]); names <- colnames(data)
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
# HCA performance
res.pca <- PCA(data, graph = FALSE,scale.unit = TRUE,quali.sup = 1)
res.hcpc <- HCPC(res.pca, graph=FALSE,nb.clust = 2)
plot <- fviz_dend(res.hcpc, k = NULL, cex = 0.7, palette = "lancet",
label_cols = "black", rect = T, rect_fill = T, rect_border = "lancet",
type="rectangle", labels_track_height = 1000, horiz = T, repel = T) +
ggtitle(paste("Cluster Dendrogram of", sub("_.*.?", "", i),
"normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0(image_folder, "/HCPC_plot_of_", sub("_.*.?", "", i), ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
}
}
The p-value is a measure used in statistical hypothesis testing to determine the significance of a comparison. When the p-value is smaller than a significance level “α”, it indicates that there is sufficient evidence to reject the null hypothesis (H0). Most usually “α” value use to be 0.05 (5%) or 0.01 (1%).
In this research, where “α” = 0.05, the null hypothesis suggests that there are no significant differences between the amounts of the metabolite in each condition. By rejecting the H0 and accepting the alternative hypothesis (H1), we conclude that there are statistically significant differences in the amounts of the metabolite between the conditions being compared.
* H0 —> there is no difference between the sample groups for the amounts of the metabolites. * H1 —> There is a difference between the sample groups for the amounts of the metabolites.
This indicates a significant difference in the metabolite’s amount between two different conditions. By fixing an LFC value of 1 or -1, we will consider as statistically significant only those metabolites that have a difference in amount of at least two-fold or greater in either direction.
If we fix an LFC value of 2 or -2, we will be more restrictive and consider only those metabolites that have a difference in amount of at least four-fold or greater in either direction. This approach will capture larger and more pronounced changes in metabolite levels.
files2 <- vector()
for (i in files){
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
met_names <- row.names(data) %>% .[-1]
group_rep <- as.character(data[1 ,]) %>% table(.)
group <- as.character(data[1 ,]) %>% unique(.)
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% `row.names<-` (., met_names)
# Prepare limma assay
experimental.design <- model.matrix(~ -1+factor(c(rep(1, group_rep[[1]]),rep(2, group_rep[[2]])))) %>%
`colnames<-`(., group)
# Limma assay (condition 2 vs. condition 1)
linear.fit <- lmFit(data, experimental.design)
contrast <- paste0(group[2],"-",group[1])
contrast.matrix <- makeContrasts(contrast = contrast, levels = group)
contrast.linear.fit <- contrasts.fit(linear.fit, contrast.matrix)
contrast.results <- eBayes(contrast.linear.fit)
# New dataframe with stats
data_stats <- topTable(contrast.results, number = nrow(data), coef = 1)
new_name <- paste0(sub("_.*.?", "", i), "_stats")
assign(new_name, data_stats)
files2 <- append(files2, new_name)
output <- paste(rownames(data_stats[1, ]), "has the following statistics",
paste(colnames(data_stats), data_stats[1, ], sep = ": ", collapse = ", "))
cat(sprintf("The file %s has been processed using the Limma package, comparing condition %s against condition %s, and the statistical information obtained has been saved in the file %s. Here is an example of the information it contains: metabolite %s\n", i, group[2], group[1], new_name, output))
cat("\n")
}
## The file compA_normalized has been processed using the Limma package, comparing condition cA_S17 against condition cA_S25, and the statistical information obtained has been saved in the file compA_stats. Here is an example of the information it contains: metabolite Pseudouridine has the following statistics logFC: -2.8776344198033, AveExpr: 19.2391476963404, t: -28.6142616251652, P.Value: 3.53124446142366e-07, adj.P.Val: 8.22779959511713e-05, B: 7.73818466181533
##
## The file compB_normalized has been processed using the Limma package, comparing condition cB_OTA_S17 against condition cB_noOTA_S25, and the statistical information obtained has been saved in the file compB_stats. Here is an example of the information it contains: metabolite thiodiglycol has the following statistics logFC: -4.02354883533047, AveExpr: 20.6764256590878, t: -33.0512253390896, P.Value: 6.35913284170357e-08, adj.P.Val: 1.0738125506286e-05, B: 9.23023126088469
##
## The file compC_normalized has been processed using the Limma package, comparing condition cC_OTA against condition cC_noOTA, and the statistical information obtained has been saved in the file compC_stats. Here is an example of the information it contains: metabolite (S,S)-Anacine has the following statistics logFC: 11.7537798899378, AveExpr: 20.7159678775536, t: 42.4798956282828, P.Value: 7.70195141396493e-15, adj.P.Val: 4.08973620081538e-12, B: 21.8781524022808
##
## The file compD_normalized has been processed using the Limma package, comparing condition cD_OTA against condition cD_noOTA, and the statistical information obtained has been saved in the file compD_stats. Here is an example of the information it contains: metabolite SPF-32629A has the following statistics logFC: -6.7298492072946, AveExpr: 14.9705154269062, t: -48.5771686424746, P.Value: 1.36206814674928e-20, adj.P.Val: 9.65706316045242e-18, B: 34.2453574248732
##
## The file compE_normalized has been processed using the Limma package, comparing condition cE_OTA against condition cE_noOTA, and the statistical information obtained has been saved in the file compE_stats. Here is an example of the information it contains: metabolite 1-(4-Hydroxyphenyl)-2-aminoethanol has the following statistics logFC: -6.4448470289833, AveExpr: 22.9173710368963, t: -42.5519897138641, P.Value: 2.44718315116484e-08, adj.P.Val: 2.61958247328103e-06, B: 10.2286119933696
cat(sprintf("Vector files2 has the following information %s.\n", paste(files2, collapse = ", ")))
## Vector files2 has the following information compA_stats, compB_stats, compC_stats, compD_stats, compE_stats.
Multiple comparisons refer to the situation where several pairwise comparisons are made within a dataset or experiment. When conducting multiple comparisons, the probability of obtaining false positive results (Type I errors) increases. To address this issue, various statistical methods can be used to adjust the significance level (p-value) or control the overall error rate.
It is a statistical concept that refers to the probability of making at least one Type I error (rejecting a true null hypothesis) among a set of multiple hypothesis tests. The FWER is a measure that quantifies the overall error rate when multiple hypotheses are tested simultaneously.
It involves dividing the significance level (p-value) by the number of comparisons performed. For example, if there are 10 comparisons being made and a desired overall significance level of 5%, the p-value threshold would be adjusted to 0.005 (0.05 divided by 10). If the p-value obtained in a comparison is lower than this adjusted threshold, it is considered statistically significant. This method adjusts the significance level for each individual hypothesis to maintain the desired FWER.
The “BH” and “BY” methods of Benjamini, Hochberg, and Yekutieli control the false discovery rate (FDR), the expected proportion of false discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition than the family-wise error rate, so these methods are more powerful than the others.
The Benjamini-Hochberg method is another technique used to control the type I error in multiple comparisons. Unlike Bonferroni, B-H is less conservative and can be more powerful in detecting statistically significant differences. Instead of adjusting each individual p-value, B-H controls the expected proportion of false discoveries among the rejected hypotheses. This is achieved by ordering the p-values from lowest to highest, calculating a critical value based on the desired FDR rate, and comparing it to each p-value in order. P-values below the critical value are considered statistically significant.
Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.
P-VALUE ADJUSTMENT
for (i in files2) {
message(paste("Processing file:", i))
data <- get(i)
data$p_value_bonferroni <- p.adjust(data$P.Value, method = "bonferroni", n = length(data$P.Value))
names(data)[names(data) == "adj.P.Val"] <- "p_value_BH"
new_name <- paste0("q_value_", sub("_.*.?", "", i))
tmp <- data$P.Value
if (i == "compC_stats"){
message(paste("Error in qvalue calculation for file:", i))
tmp_q_value <- qvalue(tmp, pi0 = 1)
message(paste("File", i, "processed successfully with pi0 = 1"))
} else {
tmp_q_value <- qvalue(tmp)
message(paste("File", i, "processed successfully"))
}
data$q_value <- tmp_q_value$qvalues
assign(new_name, tmp_q_value)
assign(i, data)
}
## Processing file: compA_stats
## File compA_stats processed successfully
## Processing file: compB_stats
## File compB_stats processed successfully
## Processing file: compC_stats
## Error in qvalue calculation for file: compC_stats
## File compC_stats processed successfully with pi0 = 1
## Processing file: compD_stats
## File compD_stats processed successfully
## Processing file: compE_stats
## File compE_stats processed successfully
| logFC | AveExpr | t | P.Value | p_value_BH | B | p_value_bonferroni | q_value | |
|---|---|---|---|---|---|---|---|---|
| Pseudouridine | -2.877634 | 19.23915 | -28.61426 | 3.5e-07 | 8.2e-05 | 7.738185 | 8.2e-05 | 4.7e-05 |
| Uracil;2_4-Dihydroxypyrimidine | -3.473787 | 23.17253 | -22.33493 | 1.4e-06 | 1.6e-04 | 6.395329 | 3.2e-04 | 9.1e-05 |
| 1-(4-Hydroxyphenyl)-2-aminoethanol | 2.165694 | 25.21030 | 18.68856 | 3.6e-06 | 2.8e-04 | 5.378592 | 8.3e-04 | 1.6e-04 |
| Asenjonamide C | 2.759073 | 23.48339 | 15.76866 | 8.9e-06 | 4.0e-04 | 4.383495 | 2.1e-03 | 2.3e-04 |
| [FAhydroxy(18:0)]12_13-dihydroxy-9Z-octadecenoicacid | 2.009285 | 22.92325 | 15.74301 | 9.0e-06 | 4.0e-04 | 4.373868 | 2.1e-03 | 2.3e-04 |
| N-Acetyl-beta-D-galactosamine | -1.107902 | 21.33208 | -14.88642 | 1.2e-05 | 4.0e-04 | 4.042318 | 2.8e-03 | 2.3e-04 |
| logFC | AveExpr | t | P.Value | p_value_BH | B | p_value_bonferroni | q_value | |
|---|---|---|---|---|---|---|---|---|
| thiodiglycol | -4.023549 | 20.67643 | -33.05123 | 6.4e-08 | 1.1e-05 | 9.230231 | 2.1e-05 | 3.5e-06 |
| 2-Oxovalericacid | 5.599484 | 21.30886 | 32.92127 | 6.5e-08 | 1.1e-05 | 9.210657 | 2.1e-05 | 3.5e-06 |
| Malyngolide | 2.714021 | 20.20551 | 22.59119 | 5.9e-07 | 6.5e-05 | 7.173198 | 1.9e-04 | 2.1e-05 |
| Hexadeca-5,9-dienoic acid | 2.160233 | 19.49440 | 19.31046 | 1.5e-06 | 1.2e-04 | 6.248042 | 4.9e-04 | 3.9e-05 |
| Sclerotigenin | 2.395235 | 20.99894 | 17.74552 | 2.4e-06 | 1.4e-04 | 5.736949 | 8.0e-04 | 4.4e-05 |
| Sulcatone | 2.701423 | 21.72628 | 17.66307 | 2.5e-06 | 1.4e-04 | 5.708569 | 8.2e-04 | 4.4e-05 |
| logFC | AveExpr | t | P.Value | p_value_BH | B | p_value_bonferroni | q_value | |
|---|---|---|---|---|---|---|---|---|
| (S,S)-Anacine | 11.753780 | 20.71597 | 42.47990 | 7.7e-15 | 4.1e-12 | 21.87815 | 4.1e-12 | 4.1e-12 |
| Tanikolide | 7.025822 | 17.89991 | 37.16378 | 4.0e-14 | 1.1e-11 | 20.91319 | 2.1e-11 | 1.1e-11 |
| 1,4-Anhydro-5-O-tetradecanoyl-D-glucitol | 8.088456 | 16.14435 | 35.70987 | 6.6e-14 | 1.2e-11 | 20.60305 | 3.5e-11 | 1.2e-11 |
| ProstaglandinF1a | 7.574203 | 16.54710 | 33.44378 | 1.5e-13 | 2.0e-11 | 20.07219 | 7.8e-11 | 2.0e-11 |
| 4,10-dihydroxy-10-methyl-dodecan-4-olide | 6.853155 | 17.48641 | 32.11155 | 2.4e-13 | 2.5e-11 | 19.73007 | 1.3e-10 | 2.5e-11 |
| Takinolide seco-acid | 5.883005 | 17.23582 | 31.71270 | 2.8e-13 | 2.5e-11 | 19.62293 | 1.5e-10 | 2.5e-11 |
| logFC | AveExpr | t | P.Value | p_value_BH | B | p_value_bonferroni | q_value | |
|---|---|---|---|---|---|---|---|---|
| SPF-32629A | -6.729849 | 14.97052 | -48.57717 | 1.4e-20 | 9.7e-18 | 34.24536 | 9.7e-18 | 5.0e-18 |
| 2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine | -4.048766 | 20.74461 | -23.79326 | 4.4e-15 | 1.6e-12 | 24.43944 | 3.1e-12 | 8.0e-13 |
| 6-formylsalicylic acid | -6.304306 | 15.21251 | -21.50028 | 2.6e-14 | 6.1e-12 | 22.82836 | 1.8e-11 | 3.1e-12 |
| C17H37P3 | -5.007224 | 19.52357 | -17.72204 | 7.3e-13 | 1.3e-10 | 19.68782 | 5.2e-10 | 6.6e-11 |
| C31H42O3P2 | -4.530932 | 17.59934 | -16.75030 | 1.9e-12 | 2.7e-10 | 18.76162 | 1.3e-09 | 1.4e-10 |
| C6H16N6O4P2 | 2.178261 | 20.59727 | 16.19640 | 3.4e-12 | 4.0e-10 | 18.20842 | 2.4e-09 | 2.0e-10 |
| logFC | AveExpr | t | P.Value | p_value_BH | B | p_value_bonferroni | q_value | |
|---|---|---|---|---|---|---|---|---|
| 1-(4-Hydroxyphenyl)-2-aminoethanol | -6.444847 | 22.91737 | -42.55199 | 2.4e-08 | 2.6e-06 | 10.228612 | 4.6e-06 | 8.8e-07 |
| Asenjonamide C | -7.371831 | 20.86065 | -41.66516 | 2.8e-08 | 2.6e-06 | 10.126882 | 5.2e-06 | 8.8e-07 |
| C4H9N5O6 | 2.730207 | 19.94422 | 30.90247 | 1.5e-07 | 7.8e-06 | 8.571225 | 2.8e-05 | 2.6e-06 |
| (R)-4-Dehydropantoate | 3.828108 | 22.02130 | 30.38516 | 1.6e-07 | 7.8e-06 | 8.477877 | 3.1e-05 | 2.6e-06 |
| C30H52NP3 | 4.262037 | 19.66849 | 26.42986 | 3.6e-07 | 1.4e-05 | 7.688768 | 6.9e-05 | 4.6e-06 |
| 1(2H)-Isoquinolinone | -3.563701 | 21.78231 | -24.55777 | 5.5e-07 | 1.7e-05 | 7.261687 | 1.0e-04 | 5.8e-06 |
P-VALUE REPRESENTATION
for (i in files2){
# Q-value plots:
q_value_plot <- paste0("q_value_", sub("_.*.?", "", i))
q_value_plot <- get(q_value_plot)
plot(q_value_plot, main = paste("Q-value plots of file", i))
# P-values histograms:
data <- get(i)
hist(data$P.Value, main = paste("Original p-value of file", i),
xlab = "p-value", col = "gold")
}
## SIGNIFICANT METABOLITES in file compA_stats with a LFC value of 2 and a α value of 0.01.
##
## Total number of metabolites is 233.
##
## The significant metabolites in file compA_stats by original p-values obteined by limma are 16.
## - 7 are increased.
## - 9 are decreased.
## - The most significant metabolite increased in cA_S17 condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
## - Whereas the most significant metabolite decreased in cA_S17 condition is Pseudouridine.
##
## The significant metabolites in file compA_stats by Bonferroni p-values are 10.
## - 5 are increased.
## - 5 are decreased.
## - The most significant metabolite increased in cA_S17 condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
## - Whereas the most significant metabolite decreased in cA_S17 condition is Pseudouridine.
##
## The significant metabolites in file compA_stats by Benjamini-Hochberg p-values are 16.
## - 7 are increased.
## - 9 are decreased.
## - The most significant metabolite increased in cA_S17 condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
## - Whereas the most significant metabolite decreased in cA_S17 condition is Pseudouridine.
##
## The significant metabolites in file compA_stats by q-value testare 16.
## - 7 are increased.
## - 9 are decreased.
## - The most significant metabolite increased in cA_S17 condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
## - Whereas the most significant metabolite decreased in cA_S17 condition is Pseudouridine.
##
## |------------------------|
##
## SIGNIFICANT METABOLITES in file compB_stats with a LFC value of 2 and a α value of 0.01.
##
## Total number of metabolites is 330.
##
## The significant metabolites in file compB_stats by original p-values obteined by limma are 19.
## - 9 are increased.
## - 10 are decreased.
## - The most significant metabolite increased in cB_OTA_S17 condition is 2-Oxovalericacid.
## - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is thiodiglycol.
##
## The significant metabolites in file compB_stats by Bonferroni p-values are 12.
## - 8 are increased.
## - 4 are decreased.
## - The most significant metabolite increased in cB_OTA_S17 condition is 2-Oxovalericacid.
## - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is thiodiglycol.
##
## The significant metabolites in file compB_stats by Benjamini-Hochberg p-values are 18.
## - 9 are increased.
## - 9 are decreased.
## - The most significant metabolite increased in cB_OTA_S17 condition is 2-Oxovalericacid.
## - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is thiodiglycol.
##
## The significant metabolites in file compB_stats by q-value testare 19.
## - 9 are increased.
## - 10 are decreased.
## - The most significant metabolite increased in cB_OTA_S17 condition is 2-Oxovalericacid.
## - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is thiodiglycol.
##
## |------------------------|
##
## SIGNIFICANT METABOLITES in file compC_stats with a LFC value of 2 and a α value of 0.01.
##
## Total number of metabolites is 531.
##
## The significant metabolites in file compC_stats by original p-values obteined by limma are 160.
## - 80 are increased.
## - 80 are decreased.
## - The most significant metabolite increased in cC_OTA condition is (S,S)-Anacine.
## - Whereas the most significant metabolite decreased in cC_OTA condition is C17H35P3.
##
## The significant metabolites in file compC_stats by Bonferroni p-values are 99.
## - 60 are increased.
## - 39 are decreased.
## - The most significant metabolite increased in cC_OTA condition is (S,S)-Anacine.
## - Whereas the most significant metabolite decreased in cC_OTA condition is C17H35P3.
##
## The significant metabolites in file compC_stats by Benjamini-Hochberg p-values are 154.
## - 78 are increased.
## - 76 are decreased.
## - The most significant metabolite increased in cC_OTA condition is (S,S)-Anacine.
## - Whereas the most significant metabolite decreased in cC_OTA condition is C17H35P3.
##
## The significant metabolites in file compC_stats by q-value testare 154.
## - 78 are increased.
## - 76 are decreased.
## - The most significant metabolite increased in cC_OTA condition is (S,S)-Anacine.
## - Whereas the most significant metabolite decreased in cC_OTA condition is C17H35P3.
##
## |------------------------|
##
## SIGNIFICANT METABOLITES in file compD_stats with a LFC value of 2 and a α value of 0.01.
##
## Total number of metabolites is 709.
##
## The significant metabolites in file compD_stats by original p-values obteined by limma are 64.
## - 35 are increased.
## - 29 are decreased.
## - The most significant metabolite increased in cD_OTA condition is C6H16N6O4P2.
## - Whereas the most significant metabolite decreased in cD_OTA condition is SPF-32629A.
##
## The significant metabolites in file compD_stats by Bonferroni p-values are 33.
## - 14 are increased.
## - 19 are decreased.
## - The most significant metabolite increased in cD_OTA condition is C6H16N6O4P2.
## - Whereas the most significant metabolite decreased in cD_OTA condition is SPF-32629A.
##
## The significant metabolites in file compD_stats by Benjamini-Hochberg p-values are 55.
## - 28 are increased.
## - 27 are decreased.
## - The most significant metabolite increased in cD_OTA condition is C6H16N6O4P2.
## - Whereas the most significant metabolite decreased in cD_OTA condition is SPF-32629A.
##
## The significant metabolites in file compD_stats by q-value testare 60.
## - 32 are increased.
## - 28 are decreased.
## - The most significant metabolite increased in cD_OTA condition is C6H16N6O4P2.
## - Whereas the most significant metabolite decreased in cD_OTA condition is SPF-32629A.
##
## |------------------------|
##
## SIGNIFICANT METABOLITES in file compE_stats with a LFC value of 2 and a α value of 0.01.
##
## Total number of metabolites is 190.
##
## The significant metabolites in file compE_stats by original p-values obteined by limma are 29.
## - 14 are increased.
## - 15 are decreased.
## - The most significant metabolite increased in cE_OTA condition is C4H9N5O6.
## - Whereas the most significant metabolite decreased in cE_OTA condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
##
## The significant metabolites in file compE_stats by Bonferroni p-values are 21.
## - 9 are increased.
## - 12 are decreased.
## - The most significant metabolite increased in cE_OTA condition is C4H9N5O6.
## - Whereas the most significant metabolite decreased in cE_OTA condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
##
## The significant metabolites in file compE_stats by Benjamini-Hochberg p-values are 29.
## - 14 are increased.
## - 15 are decreased.
## - The most significant metabolite increased in cE_OTA condition is C4H9N5O6.
## - Whereas the most significant metabolite decreased in cE_OTA condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
##
## The significant metabolites in file compE_stats by q-value testare 29.
## - 14 are increased.
## - 15 are decreased.
## - The most significant metabolite increased in cE_OTA condition is C4H9N5O6.
## - Whereas the most significant metabolite decreased in cE_OTA condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
##
## |------------------------|
for (i in files2){
data <- get(i)
# Obtain the significant metabolites by original p-values
significant <- subset(data, (logFC > stats_values$LFC_limit & P.Value < stats_values$alpha_value) |
(logFC < -stats_values$LFC_limit & P.Value < stats_values$alpha_value))
Original_mets <- row.names(significant)
# Obtain the significant metabolites by BH p-values
significant <- subset(data, (logFC > stats_values$LFC_limit & p_value_BH < stats_values$alpha_value) |
(logFC < -stats_values$LFC_limit & p_value_BH < stats_values$alpha_value))
BH_mets <- row.names(significant)
# Obtain the significant metabolites by original p-values
significant <- subset(data, (logFC > stats_values$LFC_limit & p_value_bonferroni < stats_values$alpha_value) |
(logFC < -stats_values$LFC_limit & p_value_bonferroni < stats_values$alpha_value))
Bonferroni_mets <- row.names(significant)
# Obtain the significant metabolites by q-values
significant <- subset(data, (logFC > stats_values$LFC_limit & q_value < stats_values$alpha_value) |
(logFC < -stats_values$LFC_limit & q_value < stats_values$alpha_value))
Qvalue_mets <- row.names(significant)
significant <- list(Original_mets = Original_mets, BH_mets = BH_mets, Bonferroni_mets = Bonferroni_mets,
Qvalue_mets = Qvalue_mets)
# Performance Venn diagram
random_color <- sample(colors(), 1)
plot <- ggVennDiagram(significant, label_alpha = 0.4) +
scale_fill_gradient(low="white", high = random_color) + labs(fill = "Significant \nmetabolites") +
ggtitle(paste("Venn Diagram of file", i,"with the significant \nmetabolites according to the p-value used")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
print(plot)
# Upset diagram
plot <- upset(fromList(significant),
number.angles = 0, point.size = 3, line.size = 1,
sets.x.label = "Number of significant \n metabolites",
mainbar.y.label = paste("Upset Diagram of file", i,
"\nwith the significant metabolites \naccording to the p-value used
\n\nIntersection Size"),
set_size.show = T, matrix.color = random_color,
set_size.scale_max = max(sapply(significant, length))+50,
text.scale = c(1.5, 1.5, 1, 1, 1.4, 2), order.by = "freq")
print(plot)
}
# Set the p_value adjustment method selected to use in the following steps. You can choose between: p_value_BH, p_value_bonferroni, q_value and P.Value
selected_p_adjustment <- "p_value_BH"
## Create a function that bind columns and complete the missing rows with NA values
cbind.fill <- function(...){
nm <- list(...)
nm<-lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x) {
rbind(x, matrix(NA, n-nrow(x), ncol(x)))
}))
}
for (i in files2) {
# Prepare datasets
data <- get(i)
names(data) <- gsub("logFC", "log2FoldChange", gsub("AveExpr", "baseMean", gsub(selected_p_adjustment, "padj", names(data))))
data <- data[c("log2FoldChange", "baseMean", "padj")]
# Save significant metabolites as .csv file, filling with NA values till the end of the dataframe.
## Select the desired metabolites to save
increased <- subset(data, log2FoldChange > stats_values$LFC_limit & padj < stats_values$alpha_value) %>%
rownames_to_column() %>% .[, 1] %>% as.data.frame()
decreased <- subset(data, log2FoldChange < - stats_values$LFC_limit & padj < stats_values$alpha_value) %>%
rownames_to_column() %>% .[, 1] %>% as.data.frame()
## Save data
significant_data <- cbind.fill(increased, decreased) %>% `colnames<-`(lm_contrast[sub("_.*.?", "", i)][[1]])
write.csv(significant_data, file = paste0(input_folder, "/", paste0(sub("_.*.?", "", i),
"_significant_mets.csv")), row.names = FALSE)
# Performance MA plot
plot <- ggmaplot(
data, fdr = stats_values$alpha_value, fc = stats_values$LFC_limit,
genenames = row.names(data), detection_call = NULL, size = 1.5,
alpha = 1, seed = 42, font.label = c(10, "italic", "black"),
label.rectangle = F, palette = c("#B31B21", "#1465AC", "darkgray"),
top = 4, select.top.method = c("padj", "fc"), label.select = NULL,
main = NULL, xlab = "Log2 mean expression", ylab = "Log2 fold change",
ggtheme = theme_pubr()) +
ggtitle(paste0("MA plot of file ", i, " (", lm_contrast[sub("_.*.?", "", i)][[1]][1],
" Vs. ",lm_contrast[sub("_.*.?", "", i)][[1]][2], ")")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0(image_folder, "/MA_plot_of_", sub("_.*.?", "", i), ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
}
j = 0
plots <- list()
tables1 <- list()
RDS_files <- list()
for (i in files2){
data <- get(i) %>% rownames_to_column("Metabolites") %>% .[c("Metabolites", "logFC", selected_p_adjustment)]
data <- data %>%
mutate(Significant = ifelse(abs(logFC) < stats_values$LFC_limit & p_value_BH > stats_values$alpha_value, "ns",
ifelse(logFC >= stats_values$LFC_limit & p_value_BH < stats_values$alpha_value, "up",
ifelse(logFC <= -stats_values$LFC_limit & p_value_BH < stats_values$alpha_value, "down",
"ns"))))
# Make a table with the 10 most significant metabolites to each condition
mets <- bind_cols(data %>% arrange(desc(Significant)) %>% head(10) %>% .$Metabolites,
data %>% arrange(Significant) %>% head(10) %>% .$Metabolites) %>%
`colnames<-`(c(lm_contrast[sub("_.*.?", "", i)][[1]][1], lm_contrast[sub("_.*.?", "", i)][[1]][2]))
## Save data set in a list with the correspondence name
RDS_files[[sub("_.*.?", "", i)]] <- mets
j <- j + 1
table <- mets %>%
kbl(caption = paste("Table", j,": The 10 most significant metabolites to each condition of",
sub("_.*.?", "", i), "are: ")) %>%
kable_classic(full_width = F, html_font = "Cambria")
index <- length(tables1) + 1
tables1[[index]] <- table
# Make a dataframe with the most significant metabolites, in order to fix them in the volcano plot
mets <- list()
mets <- bind_rows(data %>% arrange(desc(Significant)) %>% head(5),
data %>% arrange(Significant) %>% head(5))
plot <- data %>%
ggplot(aes(x=logFC, y=-log10(p_value_BH), color=Significant,
text=paste0("</br>Metabolite: ", Metabolites,
"</br>Adj p-value: ", p_value_BH,
"</br>LFC: ", logFC))) +
geom_point(alpha=0.75, shape=16) + xlim(-15,15) +
theme_gray() + theme(legend.position = "bottom") +
ggtitle(paste0("Volcano plot of file ", i, " (", lm_contrast[sub("_.*.?", "", i)][[1]][1],
" Vs. ",lm_contrast[sub("_.*.?", "", i)][[1]][2], ")")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold")) +
labs(color = "Significant \nMetabolite") +
scale_color_manual(values = c("up" = "#faaa11", "down" = "#26b3ff", "ns" = "grey")) +
geom_point(data = mets, aes(color = Significant), size = 4) +
geom_text_repel(data = mets, aes(label = Metabolites), size = 3, color = "black")
print(plot)
ggsave(filename = paste0(image_folder, "/Volcano_plot_of_", sub("_.*.?", "", i), ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
index <- length(plots) + 1
plots[[index]] <- plot
}
## New names:
## New names:
## • `` -> `...1`
## • `` -> `...2`
## New names:
## • `` -> `...1`
## • `` -> `...2`
## New names:
## • `` -> `...1`
## • `` -> `...2`
## New names:
## • `` -> `...1`
## • `` -> `...2`
#if (imputed == "Yes") {
# saveRDS(RDS_files, file = "../../Tables_TFM/significant_mets.rds")
#} else {
# saveRDS(RDS_files, file = "../../Tables_TFM/significant_mets_common.rds")
#}
tables1[[1]]
| cA_S17 | cA_S25 |
|---|---|
| 1-(4-Hydroxyphenyl)-2-aminoethanol | Pseudouridine |
| Asenjonamide C | Uracil;2_4-Dihydroxypyrimidine |
| [FAhydroxy(18:0)]12_13-dihydroxy-9Z-octadecenoicacid | L-Phenylalanine |
| Benzylamine | Cytisine |
| Geranylacetate | Peniciginseng B |
| Photopyrone G | 6-heptyl-2H-pyran-2-one |
| 10_16-Dihydroxyhexadecanoicacid | ?-(Methylenecyclopropyl)glycine |
| N-Acetyl-beta-D-galactosamine | (-)-Jasmonicacid |
| Palmitoleicacid | N-(6-Aminohexanoyl)-6-aminohexanoate |
| (-)-alpha-Cedrene | N-Acetyl-beta-D-galactosamine |
tables1[[2]]
| cB_OTA_S17 | cB_noOTA_S25 |
|---|---|
| 2-Oxovalericacid | thiodiglycol |
| Malyngolide | allopurinol |
| Hexadeca-5,9-dienoic acid | N,N,N’,N’-Tetrakis(2-methoxyethyl)malonamide |
| Sclerotigenin | Uracil;2_4-Dihydroxypyrimidine |
| Sulcatone | Xanthine |
| 2’-Deoxyinosine | Pseudouridine |
| (S,S)-Anacine | 5-hydroxymethylfurfural |
| 2-{4-(Cyclohexylamino)-1-methyl-1H-pyrazolo[3,4-d]pyrimidin-6-ylamino}ethanol | 2_2’-Iminodipropanoate |
| Aflatoxin b2 | Norepinephrine(Noradrenaline) |
| C12H29N4P | C12H29N4P |
tables1[[3]]
| cC_OTA | cC_noOTA |
|---|---|
| (S,S)-Anacine | C17H35P3 |
| Tanikolide | Pyrenochaetic acid F |
| 1,4-Anhydro-5-O-tetradecanoyl-D-glucitol | 3-{2-[2-(3-Azidopropoxy)ethoxy]ethoxy}-1-propanamine |
| ProstaglandinF1a | Inosine |
| 4,10-dihydroxy-10-methyl-dodecan-4-olide | allopurinol |
| Takinolide seco-acid | C15H33P3 |
| N-{2-Methyl-5-[(3-methylbutanoyl)amino]phenyl}-9-oxooctahydro-2H-pyrazino[1,2-a]pyrazine-2-carboxamide | PALGLY |
| ethyl 9,9-dimethoxynonanoate | C4H9N8O5P |
| Neolymphostinol C | Methyl N6-acetyl-L-lysinate |
| 2’-Deoxyinosine | Appenolide C |
tables1[[4]]
| cD_OTA | cD_noOTA |
|---|---|
| C6H16N6O4P2 | SPF-32629A |
| 6-Methoxy-8<6’-hydroxy-2’-methoxy-4’(3’‘-methyl-2’’-butenyl)phenoxy>2,2-dimethylchromene | 2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine |
| N5-Ethyl-L-glutamine | 6-formylsalicylic acid |
| Acetyl-L-carnitine | C17H37P3 |
| 3-Methylbut-2-enal | C31H42O3P2 |
| 1-Pentylamine | 1-O-Tetradecanoyl-D-glucitol |
| 6,7-dihydroxy-4,5,6,7-tetrahydroindole-4-one | Pentaerythritol laurate |
| C21H38N5O7P | C4H9N8O5P |
| Asperterrestide A | 8-(4-Ethyl-1-piperazinyl)-3-methyl-7-(2-methyl-2-propen-1-yl)-3,7-dihydro-1H-purine-2,6-dione |
| 2-Methyl-2-propanyl 1,4,10-trioxa-7,13-diazacyclopentadecane-7-carboxylate | allopurinol |
tables1[[5]]
| cE_OTA | cE_noOTA |
|---|---|
| C4H9N5O6 | 1-(4-Hydroxyphenyl)-2-aminoethanol |
| (R)-4-Dehydropantoate | Asenjonamide C |
| C30H52NP3 | 1(2H)-Isoquinolinone |
| 3-Dehydrocarnitine | Maristachone B |
| 1_6_6-Trimethyl-2_7-dioxabicyclo[3.2.2]nonan-3-one | Capsidiol |
| Tanikolide | N,N,N’,N’-Tetrakis(2-methoxyethyl)malonamide |
| 1_2_4-Trimethylbenzene | Phloionolic acid |
| 2-Oxo-delta3-4_5_5-trimethylcyclopentenylacetate | (6RS,10RS)-6,10-dimethylbicyclo[4.4.0]dec-1-en-3-one |
| [STtrihydrox]3alpha_11beta_21-5alpha-trihydroxy-pregnane-20-one | 8-{[2-(Diethylamino)ethyl]amino}-3-methyl-7-(2-methyl-2-propen-1-yl)-3,7-dihydro-1H-purine-2,6-dione |
| N-(2-Hexyl-4-oxo-3(4H)-quinazolinyl)-4-hydroxy-1-isobutyl-2-oxo-1,2,5,6,7,8-hexahydro-3-quinolinecarboxamide | AminoDHQ |
PLS-DA is a statistical method used for classification. It helps to predict or classify samples into different groups based on a set of variables. It is particularly useful for high-dimensional data like metabolomics. By applying PLS-DA, you can identify the important variables that distinguish between different groups, providing valuable insights for further analysis.
In PLS-DA, the method combines the principles of Partial Least Squares (PLS) regression and Discriminant Analysis. It aims to find a linear combination of the predictor variables that maximally explains the variation in the response variable while also maximizing the separation between different groups or categories.
In this case, given the large number of variables available, we decided to perform a sparse PLS-DA analysis. We limited the analysis to 2 components based on the good separation achieved in the PCA assay. Additionally, we reduced the number of variables to 25.
The prediction capacity of the “pls” model will be analysed by using the “predict” function from mixOmics package. The algorithim used to identify the similarities between the model and the new data is the Mahalanobis distance.
tables1 <- list()
tables2 <- list()
RDS_files <- list()
RDS_files1 <- list()
for (i in files) {
j <- j + 1
data <- get(i)
Y <- as.character(data[1 ,]) %>% .[-1] %>% as.factor(.)
mets <- (data[, 1]) %>% .[-1] %>% as.data.frame() %>% `colnames<-`(., "Mets_names") %>%
mutate(ID = paste0("Metabolite_", 1:(nrow(data)-1)))
write.csv(mets, file = paste0(input_folder, "/", paste0(sub("_.*.?", "", i), "_mets_names.csv")), row.names = FALSE)
data <- data[-1, -1] %>% mutate_all(as.numeric) %>% t() %>% `colnames<-`(., mets$ID)
# Prepare data for the sPLS-DA
## Remove variables with zero standard deviation
sd_values <- apply(data, 2, sd)
filtered_data <- data[, sd_values > 0] %>% as.data.frame(); X <- filtered_data
# Perform sPLS-DA
tmp_pls <- mixOmics::splsda(X, Y, ncomp = 2,
keepX = 25,
scale = FALSE)
# Prediction capacity
## Obtain the mean sd value calculated per sample in the dataset
mean_sd <- mean(apply(X, 1, sd))
## Randomly select 5 observations from the X matrix
set.seed(123)
s <- sample(1:nrow(X), 5)
Xnew = X[s, , drop = FALSE]
## Modify original samples
Xnew = t(apply(Xnew, 1, function(x){x + rnorm(ncol(X), 0, mean_sd/10)}))
## Perform de prediction
myprediction = predict(tmp_pls, Xnew, dist = "mahalanobis.dist")
## Save data set in a list with the correspondence name
RDS_files[[sub("_.*.?", "", i)]] <- myprediction$class
table <- myprediction$class %>%
kbl(caption = paste("Table", j,": The predicted classes for the randomly and edited samples of",
sub("_.*.?", "", i), "are: ")) %>%
kable_classic(full_width = F, html_font = "Cambria")
index <- length(tables1) + 1
tables1[[index]] <- table
# The 10 most important predictor variables selection to each group
## Those with positive weight in component 1 -> group_1
group_1 <- tmp_pls$loadings$X %>% as.data.frame() %>% arrange(desc(comp1)) %>%
head(10) %>% row.names() %>% {mets$Mets_names[match(., mets$ID)]}
## Those with negative weight in component 1 -> group_2
group_2 <- tmp_pls$loadings$X %>% as.data.frame() %>% arrange(comp1) %>%
head(10) %>% row.names() %>% {mets$Mets_names[match(., mets$ID)]}
## If the sign of the weight for group 1 in the first component is negative, it means that the metabolites associated with group 1 by the sPLS-DA technique will have a negative weight. On the other hand, if the sign of the weight for group 1 is positive, it indicates that the metabolites associated with group 1 by the sPLS-DA technique will have a positive weight.
if (sign(tmp_pls$loadings$Y[1,1]) < 0){
group <- tmp_pls$loadings$Y %>% row.names()
} else {
group <- tmp_pls$loadings$Y %>% row.names() %>% rev()
}
VIP_metabolites <- cbind(group_2, group_1) %>% `colnames<-`(., group)
table <- VIP_metabolites %>%
kbl(caption = paste("Table", j+5,": The 10 most important predictor variables of",
sub("_.*.?", "", i), "are: ")) %>%
kable_classic(full_width = F, html_font = "Cambria")
index <- length(tables2) + 1
tables2[[index]] <- table
# Save PLS variables loadings in RDS files
## Change ID by their metabolite name
pls_loadings <- tmp_pls$loadings$X %>% .[, 1] %>% as.data.frame()
new_row_names <- rownames(pls_loadings) %>% {mets$Mets_names[match(., mets$ID)]}
rownames(pls_loadings) <- new_row_names
## Split the data frame in a positive and negative one. Without 0 weight
positive_weight <- subset(pls_loadings, . > 0) %>% arrange(.) %>% `colnames<-`(., group[2])
negative_weight <- subset(pls_loadings, . < 0) %>% arrange(.) %>% `colnames<-`(., group[1])
## Save both data set in a list with the correspondence name
list_1 <- list()
list_1[["positive_weight"]] <- positive_weight; list_1[["negative_weight"]] <- negative_weight
RDS_files1[[sub("_.*.?", "", i)]] <- list_1
# GRAPHS
## Loadings
plotLoadings(tmp_pls, comp = 1, method = 'mean', contrib = 'max',
title = paste("Loadings of the 25 principal \nmetabolites of", sub("_.*.?", "", i)))
## Variables (Metabolites)
plotVar(tmp_pls, comp = c(1,2),
var.names = F, cex = 3, cutoff = 0.8,
title = paste("Correlation plot of the 25 \nprincipal metabolites of", sub("_.*.?", "", i)))
## HeatMap
#Execute directly in console
X11()
cim(tmp_pls, comp = 1,
xlab = "Metabolites", ylab = "Samples", margins = c(7, 15),
title = paste("Heatmap of", sub("_.*.?", "", i),
"\nnormalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method"), lwid = c(0.1, 0.9))
dev.off()
## Individual samples per component
my_colors <- c("steelblue3", "green3")
plotIndiv(tmp_pls, comp = 1:2,
group = Y, col = my_colors, style = "ggplot2",
title = paste("PC1 & PC2", i),
legend = TRUE, legend.position = "right",
legend.title = "Sampling condition", ellipse = TRUE,
ellipse.level = 0.95, centroid = FALSE)
}
#if (imputed == "Yes") {
# saveRDS(RDS_files, file = "../../Supplementary_material/Supplementary_tables/predictions_results.rds")
# saveRDS(RDS_files1, file = "../../Tables_TFM/loadings_pls_comparatives.rds")
#} else {
# saveRDS(RDS_files, file = "../../Supplementary_material/Supplementary_tables/predictions_results_common.rds")
# saveRDS(RDS_files1, file = "../../Tables_TFM/loadings_pls_comparatives_common.rds")
#}
|
|
|
|
|
| cA_S17 | cA_S25 |
|---|---|
| Asenjonamide C | Cytisine |
| Photopyrone G | Uracil;2_4-Dihydroxypyrimidine |
| 1-(4-Hydroxyphenyl)-2-aminoethanol | L-Phenylalanine |
| Geranylacetate | Pseudouridine |
| Benzylamine | Peniciginseng B |
| 10_16-Dihydroxyhexadecanoicacid | ?-(Methylenecyclopropyl)glycine |
| [FAhydroxy(18:0)]12_13-dihydroxy-9Z-octadecenoicacid | (-)-Jasmonicacid |
| N-(5-Methyl-1,2-oxazol-3-yl)-4-[(phenylcarbamoyl)amino]benzenesulfonamide | N-(6-Aminohexanoyl)-6-aminohexanoate |
| NP-002322 | 6-heptyl-2H-pyran-2-one |
| C12H25N4P | NP-015145 |
| cB_noOTA_S25 | cB_OTA_S17 |
|---|---|
| thiodiglycol | 2-Oxovalericacid |
| N,N,N’,N’-Tetrakis(2-methoxyethyl)malonamide | Malyngolide |
| Xanthine | Sulcatone |
| allopurinol | 2-{4-(Cyclohexylamino)-1-methyl-1H-pyrazolo[3,4-d]pyrimidin-6-ylamino}ethanol |
| Uracil;2_4-Dihydroxypyrimidine | Aflatoxin b2 |
| Pseudouridine | Sclerotigenin |
| 5-hydroxymethylfurfural | (S,S)-Anacine |
| Confluenine E | 2’-Deoxyinosine |
| 2_2’-Iminodipropanoate | Hexadeca-5,9-dienoic acid |
| Norepinephrine(Noradrenaline) | Isoprene |
| cC_OTA | cC_noOTA |
|---|---|
| (S,S)-Anacine | Asenjonamide C |
| 2’-Deoxyinosine | C17H35P3 |
| Purine | (-)-alpha-Cedrene |
| Neolymphostinol C | (-)-Isopiperitenone |
| Aurantiomide C | (-)-Jasmonicacid |
| 1,4-Anhydro-5-O-tetradecanoyl-D-glucitol | (-)-monascumic acid |
| Methyl 16-hydroxy-12-oxohexadecanoate | (-)-trans-Carveol |
| Aurantiomide B | (+)-Bornane-2_5-dione |
| Sclerotigenin | (+)-ethyl homononactate |
| ProstaglandinF1a | (+)-exo-5-Hydroxycamphor |
| cD_OTA | cD_noOTA |
|---|---|
| Acetyl-L-carnitine | 1D-chiro-inositol |
| Ochratoxin A | SPF-32629A |
| 1-Pentylamine | 6-formylsalicylic acid |
| MFCD00054545 | [FAoxo_amino(6:0)]3-oxo-5S-amino-hexanoicacid |
| Adenine | C17H37P3 |
| Maniwamycin C | 8-(4-Ethyl-1-piperazinyl)-3-methyl-7-(2-methyl-2-propen-1-yl)-3,7-dihydro-1H-purine-2,6-dione |
| 2’-Deoxyinosine | C31H42O3P2 |
| Purine | Pentaerythritol laurate |
| N5-Ethyl-L-glutamine | 2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine |
| Cytisine | allopurinol |
| cE_OTA | cE_noOTA |
|---|---|
| Tanikolide | Asenjonamide C |
| C30H52NP3 | 1-(4-Hydroxyphenyl)-2-aminoethanol |
| (R)-4-Dehydropantoate | 1(2H)-Isoquinolinone |
| 2-Oxo-delta3-4_5_5-trimethylcyclopentenylacetate | N,N,N’,N’-Tetrakis(2-methoxyethyl)malonamide |
| 3-Dehydrocarnitine | Maristachone B |
| [STtrihydrox]3alpha_11beta_21-5alpha-trihydroxy-pregnane-20-one | Xanthine |
| Isoprene | 1-Methylpyrrolinium |
| C4H9N5O6 | AminoDHQ |
| 1_6_6-Trimethyl-2_7-dioxabicyclo[3.2.2]nonan-3-one | Phloionolic acid |
| C12H14N10 | 6,7-DIETHYL-1,1,4,4-TETRAMETHYLTETRALINE |
HeatMaps from the pls data. To see more information run the code in the console and consulte the graph.
## ######## ## ## ###### ###### ## ## ######
## ## ## ## ## ## ### ## ## ##
## ## ## ## ## ## #### ## ## ##
## ## ####### ###### ###### ## ## ## ## ##
## ## ## ## ## ## ## #### ## ##
## ## ## ## ## ## ## ### ## ##
## ## ## ## ###### ###### ## ## ######
## Script completed successfully!